Analysis of the values in the LSI Term-Term Matrix
نویسندگان
چکیده
Singular value decomposition (SVD), the process at the heart of Latent Semantic Indexing (LSI), is a computationally expensive procedure. In this paper we analyze the relationship between higher order term cooccurrence and the values produced by the LSI process. We show a strong correlation between the number of cooccurrence paths and the value produced in the LSI term-term matrix.
منابع مشابه
Detecting Patterns in the LSI Term-Term Matrix
Higher order co-occurrences play a key role in the effectiveness of systems used for text mining. A wide variety of applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that re...
متن کاملA Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences
Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co-occurrence is a critical component of this study. In this work we present...
متن کاملAssessing the Impact of Sparsification on LSI Performance
We describe an approach to information retrieval using Latent Semantic Indexing (LSI) that directly manipulates the values in the Singular Value Decomposition (SVD) matrices. We convert the dense term by dimension matrix into a sparse matrix by removing a fixed percentage of the values. We present retrieval and runtime performance results, using seven collections, which show that using this tec...
متن کاملA Latent Semantic Structure Model for Text Classification
Latent Semantic Indexing (LSI) has been successfully applied to information retrieval and classification. LSI can deal with the problems of polysemy and synonymy, and can reduce noise in the raw document-term matrix. However, LSI may ignore important features for some small categories because they are not the most important features for all the document collection. In this paper, we describe a ...
متن کاملImproving the Banks Shareholder Long Term Values by Using Data Envelopment Analysis Model
Given the rapid development of the banking sector, it is reasonable to expect that the performance of banks has become the centre of attention among bank managers, stakeholders, policy makers, and regulators. In order to maximizing the share-holders’ satisfactory level, two bank efficiency measurement approaches, i.e. the production approach and the user cost approach, which are financial evalu...
متن کامل